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Abstract 

In  highly  parallel  message  routing  networks,  it  is  sometimes  desir¬ 
able  to  concentrate  relatively  few  messages  on  many  wires  onto  fewer 
wires.  We  have  designed  a  VLSI  chip  for  this  purpose  which  is  capable 
of  concentrating  bit-serial  messages  quickly.  This  hyperconcentrator 
switch  has  a  highly  regular  layout  using  ratioed  nMOS  and  takes  ad¬ 
vantage  of  the  relatively  fast  performance  of  large  fan-in  NOR  gates 
in  this  technology.  A  signal  incurs  exactly  21gn  gate  delays  through 
the  switch,  where  n  is  the  number  of  inputs  to  the  circuit.  The  archi¬ 
tecture  generalizes  to  domino  CMOS  as  well. 


1  Introduction 

The  problem  of  concentrating  relatively  few  communications  on  many  input 
lines  onto  a  lesser  number  of  output  lines  must  be  solved  in  communication 

This  research  was  supported  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  under  Contract  N00014-80-C-0622.  Tom  Cormen  is  supported  in  part  by  a  Na¬ 
tional  Science  Foundation  Fellowship.  Charles  Leiserson  is  supported  in  part  by  an  NSF 
Presidential  Young  Investigator  Award. 
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networks  of  all  kinds.  In  many  parallel  computing  systems,  communications 
are  packaged  into  messages  which  are  routed  among  the  processors.  This 
paper  presents  a  design  for  a  VLSI  implementation  of  a  feist  concentrator 
switch  suitable  for  routing  bit-serial  messages  in  a  parallel  supercomputer. 

An  n-by-m  concentrator  switch  has  n  input  wires  Xi,  X2, . . . ,  Xn  and 
m  <  n  output  wires  Yj,  Y2,  ■ . . ,  Ym.  The  switch  can  establish  m  disjoint 
electrical  paths  from  any  set  of  m  input  wires  to  the  m  output  wires.  A 
concentrator  switch  always  routes  as  many  messages  as  possible.  Specifi¬ 
cally,  whenever  k  out  of  the  n  input  wires  of  an  n-by-m  concentrator  switch 
carry  messages,  one  of  the  following  is  true: 

•  If  k  <  m,  then  an  electrical  path  is  established  from  each  input  wire 
which  contains  a  message  to  an  output  wire. 

•  If  k  >  m,  then  each  output  wire  has  an  electrical  path  established 
from  an  input  wire  which  contains  a  message. 

When  k  >  m,  some  messages  cannot  be  successfully  routed,  in  which  case 
we  say  the  switch  is  congested.  Typical  ways  of  handling  unsuccessfully 
routed  messages  in  a  routing  network  are  to  buffer  them,  to  misroute  them, 
or  to  simply  drop  them  and  rely  on  a  higher-level  acknowledgment  protocol 
to  detect  this  situation  and  resend  them.  The  switch  design  in  this  paper 
is  compatible  with  any  of  these  congestion  control  methods. 

One  way  to  create  a  concentrator  switch  is  with  a  hyperconcen¬ 
trator  switch.  An  n-by-n  hyperconcentrator  switch}  has  n  input  wires 
Xj,X2, . . . ,  Xn  and  n  output  wires  l'j,  Y2, . . . ,  Yn.  The  switch  can  establish 
disjoint  electrical  paths  from  any  set  of  k  input  wires,  for  any  1  <  k  <  n, 
to  the  first  k  output  wires  Y\ ,  K2, . . .  ,Y\.  In  other  words,  we  route  the  k 
messages  to  the  first  k  output  wires.  We  can  make  any  n-by-m  concen¬ 
trator  switch  from  an  n-by-n  hyperconcentrator  switch  by  simply  choosing 
the  first  m  output  wires  of  the  hyperconcentrator  switch,  Yi,  Y?, . . . ,  Ym,  as 
the  m  output  wires  of  the  concentrator  switch. 

A  hyperconcentrator  switch  can  be  implemented  using  a  sorting  net¬ 
work  [8,  pp.  220-246].  The  inputs  to  the  sorting  network  are  l’s  and  0’s, 
representing  the  presence  or  absence  of  messages  on  the  input  wires  to  the 

'The  terminology  is  drawn  from  [15]. 
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Figure  1:  An  nMOS  layout  of  a  32- by -32  hyperconcentrator  switch.  The  recursive 
nature  of  the  switch  can  easily  be  seen.  This  implementation  includes  superbuffers 
where  needed  to  provide  enough  drive  for  high  fan-out  signals. 


switch.  The  sorting  of  the 
input  messages  to  occupy  th 
Many  sorting  networks.  ? 
employ  the  technique  of  r-  -  > 
into  two  problems  of  size  r  /' 
two  sorted  sets  are  th ■  ”  1 
problem.  The  recursion  n  , 
can  be  performed  in  0(ig  >> . 
is  0(lg2  n).  Sorting  net.w. 
are  impractical  to  use  as  h 
associated  constant 

The  n-by-n  hypercou  err 


l*s  and  0’s,  with  l’s  before  0’s,  causes  the  k 
e  first  k  outputs. 

;uch  as  Batcher’s  bitonic  sort  [8,  pp.  232-233], 
i r ? i ve  merging.  A  problem  of  size  n  is  divided 
!.  which  are  recursively  solved  in  parallel.  The 
g  xl  to  produce  the  solution  to  the  original 
■  res  [lg  n]  levels,2  and  since  each  merge  step 
•  ime  in  parallel,  the  total  time  to  sort  n  values 
of  depth  O(lgn)  are  known  [1],  but  they 
’^concentrator  switches  because  of  the  large 

':.Por  switch  presented  in  this  paper  also  uses 


2 We  use  the  notation  Itr  r»  to  •(<»  loir-,  n. 


recursive  merging,  but  by  taking  advantage  of  the  relatively  fast  perfor¬ 
mance  of  high  fan-in  NOR  gates  in  nMOS  technology,  each  merge  takes 
only  2  gate  delays.  A  signal  therefore  incurs  exactly  2  fig  n]  gate  delays  in 
passing  through  the  switch.  The  switch  has  a  simple  design  and  a  regular 
layout  in  both  ratioed  nMOS  and  domino  CMOS  technologies.  Unlike  many 
concentrator  switches  in  the  literature  [11,12,13],  our  switch  sets  itself  up 
“on-line"  when  messages  are  presented  to  it. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  covers 
some  basic  terminology  and  describes  the  message  format  and  timing  model 
upon  which  the  switch  is  based.  Section  3  discusses  the  merge  box,  the 
basic  building  block  of  our  switch.  Section  4  shows  how  to  use  merge 
boxes  to  implement  the  hyperconcentrator  switch  and  describes  an  nMOS 
implementation.  Section  5  addresses  issues  that  arise  when  implementing 
the  hyperconcentrator  architecture  in  domino  CMOS  or  other  precharged 
MOS  disciplines.  Section  6  covers  some  applications  which  benefit  from  the 
switch.  Finally,  Section  7  contains  further  remarks  about  the  switch. 

2  Preliminaries 

In  this  section,  we  define  some  basic  terminology  and  notational  conven¬ 
tions  and  present  the  message  format  and  timing  model  assumed  by  the 
hyperconcentrator  switch  design. 

We  shall  adopt  some  notational  conventions  to  ease  the  exposition  in  the 
following  sections.  Bit  and  boolean  values  are  denoted  by  “1”  and  “0”,  or 
by  “high"  and  “low",  for  TRUE  and  FALSE  respectively.  Uppercase  symbols 
denote  wire  names  and  lowercase  symbols  denote  integer  values.  We  shall 
also  use  uppercase  symbols  to  denote  bit  values  on  the  wires  they  name 
when  the  usage  is  unambiguous.  Wire  names  will  usually  have  subscripts. 

We  assume  that  the  hyperconcentrator  switch  routes  bit-serial  messages. 
Each  message  is  formed  by  a  stream  of  bits  arriving  at  a  wire  at  the  rate 
of  one  bit  per  clock  cycle.  The  first  bit  of  each  message  that  arrives  at  an 
input  wire  is  the  valid  bit,  indicating  whether  subsequent  bits  arriving  on 
that  wire  form  a  valid  message  or  an  invalid  message.  The  bit  sequence 
following  a  valid  bit  of  1  forms  a  valid  message,  which  we  would  like  to  be 
routed  from  an  input  wire  to  an  output  wire  of  the  switch.  From  there  it 


may  pass  through  the  remainder  of  the  routing  network.  A  valid  bit  of  0 
indicates  an  invalid  message ,  which  does  not  need  to  be  routed  to  an  output 
wire.  We  assume  that  in  an  invalid  message,  not  only  is  the  valid  bit  0,  but 
so  are  all  the  remaining  bits  in  the  message.3 

The  valid  bits  all  arrive  at  the  input  wires  of  the  hyperconcentrator 
switch  during  the  same  clock  cycle,  which  we  call  setup.  An  external  control 
line  signals  setup.  Message  bits  entering  through  input  wires  at  cycles  after 
setup  follow  the  electrical  paths  in  the  switch  that  are  established  during 
setup. 

3  The  Merge  Box 

This  section  presents  the  design  of  the  merge  box,  the  key  portion  of  the 
hyperconcentrator  switch  architecture.  The  hyperconcentrator  switch  con¬ 
sists  of  many  merge  boxes,  of  various  sizes,  connected  as  shown  in  the  next 
section.  The  design  exploits  the  fast  performance  of  large  fan-in  NOR  gates 
in  nMOS  technology,  as  a  PLA  does,  to  merge  two  sets  of  messages  of  any 
size  in  only  two  gate  delays.  The  merge  box  design  presented  in  this  section 
uses  ratioed  nMOS  technology  and  no  pass  transistors. 

A  merge  box  merges  two  sets  of  messages,  each  set  sorted  by  their  valid 
bits,  into  one  sorted  set  of  messages.  A  merge  box  of  size  2m,  where  m  is  a 
power  of  2,  has  two  sets  of  input  wires  A\ ,  A2,  •  ■  • ,  Am  and  B\,  Bi, . . . ,  Bm 
and  one  set  of  output  wires  C\,  C2, . . . ,  C2m-  We  assume  that  the  lower- 
numbered  wires  of  both  the  A  and  B  input  sets  carry  valid  messages  and 
that  the  higher-numbered  wires  of  both  the  A  and  B  input  sets  carry  invalid 
messages.  That  is,  if  we  let  p  and  q  be  the  number  of  valid  messages  entering 
the  A  and  B  wire  sets  respectively,  where  0  <  p,  q  <  m,  we  require  that 
the  valid  bits  appear  on  the  input  wires  during  setup  as  follows: 

A\ ,  A2,  • .  ■ ,  Ap  =  1 
Ap+ 1 ,  Ap+ 2 , . . . ,  Am  0 

I?1  ,  Z?2,  .  .  .  ,  Bq  =  1 

Rq  +  l  •  R(j  +  2i  •  •  •  '  -Rm  0  • 

3This  assumption  is  easy  to  enforce  just  AND  the  valid  bit  into  each  subsequent  bit 
of  the  message. 
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Figure  2:  The  paths  taken  by  valid  messages  in  a  merge  box.  Valid  bits  are 
shown  as  they  enter  and  leave  the  merge  box.  The  p  valid  messages  arriving  at 
input  wires  A\ ,  A2, . . . ,  Ap  are  routed  to  output  wires  C\ ,  C2,  ■  ■ . ,  Cp  respectively. 
Here,  the  only  A  wires  with  valid  messages  are  A\  and  >42-  These  valid  messages 
are  routed  to  C\  and  C2  respectively.  The  q  valid  messages  arriving  at  input  wires 

Bi ,  1?2,  •  •  • ,  Bq  head  toward  C\ ,  C2, - Cq  but  are  steered  to  Cp+j ,  Cp+ 2, . . . ,  Cp+q . 

Here,  the  valid  messages  entering  through  /?],  /?2,  and  B3  are  steered  to  output 
wires  C3,  C4,  and  C5  respectively. 

During  setup,  the  merge  box  establishes  disjoint  electrical  connections 
between  the  p  +  q  input  wires  with  valid  messages  and  the  p  +  q  lower- 

numbered  output  wires  C\ .  C'2 . CP+q  'n  a  combinational  fashion,  as 

shown  in  Figure  2.  The  connections  C'  1  =  .4] ,  C 2  =  .42, . . . .  Cp  =  Ap ,  Cp+ 1  = 

B^Cp+2  =  B 2 . ^'p+q  —  Bq  ate  established,  and  valid  bits  appear  on  the 

output  wires  as  follows: 

C  1 .  c . . .  c  ,,+  l  =  1 

p  +  q+\  •  (-  I  +  1  +  t . C  2  <1  —  0  ■ 

These  connections  are  maintained  during  subsequent  cycles  for  the  remain- 


Figure  3:  A  merge  box  of  size  8.  The  input  wires  are  A4,A2,A3,A4  and 
B\,B2,  B3,  fl4.  The  output  wires  are  Ci,C 2, . . .  ,  C8.  The  switch  settings  are  stored 
during  setup  in  registers  Si,  .$2,  S3,  S4, 5s.  Here  we  have  p  —  2  and  q  =  3  during 
setup.  The  vai  i  bit  values  on  each  A,  B ,  and  C  wire  are  shown,  as  are  the  S 
switch  settings.  All  conducting  paths  to  ground  are  circled. 

ing  bits  in  the  message  streams  to  follow. 

Figure  3  is  a  schematic  diagram  of  a  merge  box  for  which  m  =  4.  This 
merge  box  includes  eight  NOR  gates,  with  diagonal  output  wires  labeled 

C'i,C2 . C 8-  Each  of  these  NOR  gate  outputs  is  inverted  to  produce 

the  merge  box  outputs  C \ ,  C2 . C8,  so  we  may  view  the  pulling  down 

of  a  diagonal  wire  C,  to  be  equivalent  to  the  corresponding  output  C, 
being  1.  The  NOR  gates  have  fan-ins  ranging  from  just  one  pulldown 
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circuit  (e.g.  the  gate  with  output  C»)  to  5  pulldown  circuits  (e.g.  the  gate 
with  output  C 4).  In  general,  the  NOR  gates  have  fan-ins  of  up  to  m  + 
1  pulldown  circuits.  Each  pulldown  circuit  consists  of  just  one  or  two 
transistors,  regardless  of  the  size  of  the  merge  box,  making  for  fast  NOR 
gates  and  low-area  pulldowns,  even  with  minimum-sized  pullups.  As  can 
be  verified  from  Figure  3.  a  merge  box  of  size'  2 m  implements  the  following 
function: 

A,  V  \J(Dj  A  S1+1_/>)  if  1  <  i  <  m 
Ci  =  ,  ,  v=1  ' 

1  2m+l-j 

V  (Bm+1-j  A  Sl+J-,n  )  if  m  <  I  <  2m  . 

>  j=i 

The  switch  settings  Si,  S2,  S;$.S.|,  S-,  are  computed  and  stored  in  registers 
during  setup,  based  on  the  valid  bits  appearing  at  the  .4  and  D  input  wires. 
In  general,  a  merge  box  of  size  2m  has  switch  settings  Si .  S2, . . . ,  Sm+J . 
These  stored  settings  continue  to  be  used  during  subsequent  cycles.  These 
switch  settings  establish  the  electrical  connections  throughout  the  entire 
hyperconcentrator  switch.  Other  than  the  storing  of  the  switch  settings, 
the  operation  of  the  merge  box  is  purely  combinational. 

Let  us  look  at  the  operation  of  the  merge  box  during  setup.  The  lower- 
numbered  A  and  B  input  wires  have  valid  bit  values  of  1.  and  the  higher- 
numbered  A  and  B  input  wires  have  valid  bit  values  of  0.  If  input  .4,  is  1. 
then  the  NOR  gate  output  C,  is  pulled  down  by  the  single  transit  tor  whose 
gate  is  A,-.  The  inverter  causes  output  C',  to  be  1.  Having  input  values 
A\,  A2,  ■  ■  ■ ,  Ap  =  1  thus  causes  the  outputs  C'\,C> . Cp  to  be  1. 

The  switch  settings  5  act  as  steering  signals,  sending  the  B  values 

B\,  B2, . . . ,  Bm  to  the  output  wires  C,,+  i .  (',.+  1 . Cr+m.  The  5  values 

are  computed  and  stored  during  setup  so  that  only  the  setting  Sp+i  is  1, 
corresponding  to  input  Ap+\  being  the  lowest -numbered  .4  with  a  valid  bit 
of  0.  (If  no  input  wire  .4,  is  0,  then  we  have  p  -  ru,  and  only  switch  Sm+i 
is  set  to  1.)  The  S  values  are  defined  by  the  valid  bits  on  the  .4  wires  as 
follows: 

5,  =  A, 

5,  —  A,_i  A  .4,  for  1  <  /  <  m 

Brri  + 1  —  Am  • 


A'..’ 


wsw 


mmmm 


'"v  -  .--V-V-V 


Of  the  two-transistor  pulldown  circuits,  only  column  p  +  1  may  possibly 
pull  a  diagonal  wire  down  to  0,  since  only  switch  setting  Sp+i  is  high. 
Similarly,  a  diagonal  wire  C,  may  be  pulled  down  only  by  input  wire  .4, 
or  the  conjunction  £?,_p  A  S,,+  i-  The  only  NOR  gate  which  may  be  pulled 
down  by  input  B\  has  output  wire  Cp+i,  and  in  general  the  only  NOR  gate 
which  may  be  pulled  down  by  input  B,  has  output  wire  Cp+l. 

For  example,  suppose  that,  as  in  Figure  3,  the  input  wires  have  the 
following  valid  bits  during  setup: 

-di,-42  -  1 

-43.-44  =  0 
BuB2.B,  =  1 
BA  =  0  . 

Then  we  have  p  ~  2,  q  =  3,  S.*  =  1.  and  all  other  S,  are  equal  to  0.  There  are 
five  valid  messages  passing  through  the  merge  box,  and  there  are  exactly 
five  conducting  paths  to  ground,  circled  in  Figure  3,  one  for  each  of  the 
first  five  diagonal  wires,  C\ .  C2.  C ;t,  C4,  C5.  These  paths  to  ground  cause 
output  values  of  1  on  the  corresponding  out  put  wires  Ci,  C2,  C3,  C4,  C5.  The 
remaining  three  diagonal  wires.  C'fi,  C?,  Cg.  are  not  pulled  down  to  ground 
by  these  input  values,  and  thus  the'  output  wires  C$,C-;<Cs  all  have  the 
value  0. 

Now  we  look  at  the  message  bits  that  arrive  after  setup.  The  switch 
settings  5  were  computed  and  stored  during  setup,  and  they  remain  un¬ 
changed  in  their  registers.  Just  as  during  setup,  a  bit  with  the  value  1 
that  enters  through  input  wire  .4,  directly  pulls  down  the  diagonal  wire  C,, 
regardless  of  the  S  values.  A  bit  with  the  value  1  that  enters  through  input 
wire  B,  may  pull  down  only  the  diagonal  wire  Cp+,  because  the  only  switch 
setting  with  value  1  is  Sp+i.  The  only  difference  in  the  merge  box  opera¬ 
tion  between  setup  and  later  cycles  is  that  the  5  values,  which  are  always 
used  to  steer  the  B  values  to  the  appropriate  C  outputs,  are  computed  and 
stored  only  during  setup.  In  the  cycles  following  setup,  the  merge  box  is  a 
combinational  circuit,  reading  the  registers  holding  the  switch  settings  S. 

Recall  that  in  Section  2  wo  required  that  all  bits  in  an  invalid  message 
must  be  0.  We  now  can  see  the  reason  for  this  restriction.  Suppose  that 
in  our  above  example,  in  which  we  had  .4  t  —  0  and  S3  =  1  during  setup. 
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we  had  that  at  some  cycle  following  setup  A:>  —  1  and  D j  =  0.  We  would 
expect  C3  to  be  0  in  this  case,  since  D\.  which  is  routed  to  C 3.  is  0.  Since 
A3  is  1,  however,  C 3  is  pulled  down,  and  C\  becomes  1.  The  requirement 
that  A3  and  A4  be  0  after  setup  eliminates  spurious  pulldowns. 

4  The  Hyperconcentrator  Switch 

In  this  section,  we  give  the  recursive  construction  for  assembling  merge 
boxes  into  a  hyperconcentrator  switch.  We  also  show  that  a  signal  incurs 
exactly  2  [lgn]  gate  delays  through  the  switch. 

A  hyperconcentrator  switch  with  input  wires  A  j ,  A' 2 . A'„,  of  which 

k  contain  messages,  and  output  wires  I'i,  I2, .  .  .  ,  Yn  routes  the  k  valid  mes¬ 
sages  to  the  first  k  output  wires  I’i.  V2,  - . .  ,  I Since  valid  messages  are 
identified  by  a  valid  bit  of  1  during  setup,  a  hypcrconcentrator  switch  may 
be  viewed  as  a  network  that  sorts  l’s  and  0's,  with  l’s  before  0’s  in  the  out¬ 
put.  The  switch  is  set  during  setup,  with  subsequent  bits  following  these 
established  electrical  paths. 

We  use  recursive  merging  to  sort  the  messages,  solving  the  subproblems 
at  each  level  of  the  recursion  in  parallel.  By  knowing  in  advance  the  size 
of  the  problem,  we  know  in  advance  exactly  how  sets  will  be  divided  and 
merged.  We  can  thus  build  the  dividing  process  into  the  hardware  and 
successively  merge  larger  sets  of  bits  through  cascades  of  parallel  merge 
boxes. 

Figure  4  shows  the  organization  of  a  16-by-lC  hyperconcentrator  switch. 
There  are  four  stages  through  which  the  bits  cascade,  from  bottom  to  top 
in  the  figure.  By  this  construction,  an  n-by-n  hyperconcentrator  switch, 
composed  of  fig  ??]  stages  of  combinational  merge  boxes,  is  itself  a  com¬ 
binational  circuit.  Signals  incur  exactly  2  [lgn]  gate  delays  in  an  r?-by-n 
hyperconcentrator  switch.  During  setup,  the  S  switches  in  each  merge  box 
are  computed  and  stored  in  registers.  These  switches  establish  electrical 
paths  for  messages  in  each  merge  box.  Since  there  are  no  other  switches  be¬ 
tween  merge  boxes,  the  S  switches  actually  establish  the  paths  through  the 
entire  hyperconcentrator  switch.  As  in  the  individual  merge  boxes,  message 
bits  that  enter  after  the  valid  bit  follow  the  established  paths  through  the 
hyperconcentrator  switch. 


$ 


1111111110000000 


0  111 


0  0  1  0  0  1  1  0  1  1  1 


Figure  4:  A  16-by-16  hyperconcentrator  switch  with  four  stages  of  merge  boxes. 
Individual  merge  boxes  are  oriented  as  in  Figure  2,  with  input  wires  A  entering 
the  bottom  left,  input  wires  B  entering  the  bottom  right,  and  output  wires  C 
leaving  the  top  left  and  top  right.  Messages  flow  from  bottom  to  top.  The  output 
wires  C\,C2, . .  .  ,Cm  of  a  merge  box  of  size  m  are  the  input  wires  of  a  merge  box 

of  size  2m,  either  Ax,  A2 . 4m  or  BX,B2 . Bm.  Valid  bit  values  are  shown 

entering  the  first  stage  and  leaving  the  last  stage  of  the  cascade.  The  electrical 
paths  established  within  the  merge  boxes  and  the  switch  during  setup  are  shown 
in  heavy  lines. 


The  area  of  this  7j-by-n  hyperconcentrator  switch  is  0(n2),  which  we 
show  as  follows.  The  area  of  a  merge  box  of  size  m  is  0(m2),  since  it 
contains  m(m  +  1)  constant-size  pulldown  circuits  and  m  +  1  constant-size 
registers.  The  area  of  an  n-by-n  hyperconcentrator  switch  is  then  given  by 
the  recurrence 

4(„)  =  |e(1)  ifn£2 

11  \  2.4(n/2)  +  0(»!)  if  n  >  2  . 

The  0(n2)  term  dominates  the  recurrence,  so  the  solution  is  is  A(n)  = 
0(n2). 

Since  the  minimum  clock  period  for  the  hyperconcentrator  switch  in¬ 
creases  with  the  size  of  the  switch,  the  clock  period  of  a  really  large  hy¬ 
perconcentrator  switch  may  be  so  long  that  other  hardware  using  the  same 
clock  cannot  operate  at  maximum  speed.  The  clock  period  of  the  hyper- 
concentrator  switch  can  be  bounded  by  placing  pipelining  registers  after 
every  sth  stage,  for  some  constant  s,  letting  messages  propagate  through 
s  stages  per  clock  cycle.  A  message  then  requires  (lgn)/s  clock  cycles 
to  pass  through  an  n-by-n  hyperconcentrator  switch.  The  architecture  of 
the  hyperconcentrator  switch  makes  the  inclusion  of  pipelining  registers  a 
straightforward  modification. 

Figure  1  shows  the  layout  of  a  32-bv-32  hyperconcentrator  switch,  us¬ 
ing  4/im  nMOS  MOSIS  design  rules.  In  order  to  provide  enough  drive 
for  the  pulldown  transistors  of  the  next  stage,  the  inverters  following  the 
NOR  gates  in  each  merge  box  are  actually  inverting  superbuffers.  Timing 
simulations  have  shown  that  the  propagation  delay  through  this  circuit  is 
under  70  nanoseconds  in  the  worst  case,  an  impressive  figure  in  light  of  the 
conservative  technology  being  simulated. 

5  Domino  CMOS  Design 

In  this  section,  we  examine  issues  that  arise  when  we  design  the  hypercon¬ 
centrator  chip  with  a  precharged  methodology  instead  of  the  level-sensitive 
approach  used  in  ratioed  nMOS  design.  We  will  be  concerned  exclusively 
with  domino  CMOS,  which  serves  as  a  good  example  of  a  precharged  dis¬ 
cipline. 
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We  must  take  care  with  the  inputs  to  domino  CMOS  gates.  In  domino 
CMOS  design,  the  outputs  of  some  gates  are  precharged  high  during  a 
precharge  phase  6.  The  pulldown  circuit  of  such  a  gate  must  be  open  dur¬ 
ing  the  precharge  phase  to  prevent  the  immediate  discharging  of  the  gate’s 
output  node.  During  an  evaluate  phase  <f>,  the  pulldown  may  optionally 
close,  allowing  the  discharge  of  the  gate’s  output.  In  a  level-sensitive  de¬ 
sign  methodology,  such  as  ratioed  nMOS,  the  pulldown  circuit  may  close 
and  then  open  again  during  a  single  clock  phase,  as  long  as  the  circuit  is 
in  its  final  state  long  enough  for  the  logic  to  settle.  In  a  domino  CMOS 
design,  however,  if  the  pulldown  circuit  closes  at  any  time  during  the  eval¬ 
uate  phase,  the  output  node  may  discharge.  Even  if  the  pulldown  circuit 
later  settles  open  during  the  same  evaluate  phase,  the  gate’s  output  node 
incorrectly  remains  low.  To  avoid  this  phenomenon  of  premature  discharg¬ 
ing,  domino  CMOS  circuits  are  designed  with  all  precharged  gate  inputs 
monotonically  increasing — having  no  l-to-0  transitions — during  the  evalu¬ 
ate  phase.  (Readers  desiring  more  information  about  domino  CMOS  design 
are  referred  to  [5]  and  [16].) 

The  circuit  resulting  from  the  straightforward  modification  of  the  ra¬ 
tioed  nMOS  design  to  domino  CMOS — adding  n-channel  evaluate  transis¬ 
tors  to  each  pulldown  circuit  and  replacing  the  depletion  mode  pullups  by 
p-channel  precharge  transistors — is  not  a  well-behaved  domino  CMOS  cir¬ 
cuit  during  setup.  It  is  well  behaved  during  cycles  following  setup,  however. 
The  simple  inverters  at  the  NOR  gate  outputs  cause  each  merge  box  output 
wire  to  mplement  a  monotonically  increasing  function,  since  the  outputs 
are  each  the  OR  of  AND’s  of  input  values.  Since  when  monotonically  in¬ 
creasing  functions  are  composed,  the  result  is  a  monotonically  increasing 
function,  the  entire  hyperconcentrator  switch  is  therefore  a  well-behaved 
domino  CMOS  circuit  after  setup. 

Unfortunately,  during  setup  not  all  the  inputs  to  the  merge  boxes  are 
computed  by  monotonically  increasing  functions.  In  particular,  the  switch 
settings  5.  defined  by  S,  =  .4 , _ i  A  .4,,  are  not  monotonically  increasing.  If 
we  start  with  .4,_i  and  .4,  at  0  and  raise  them  each  to  1,  the  value  of  S, 
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can  go  from  0  to  1  and  then  back  to  0: 


Ai-i 

A, 

Si 

0 

0 

0 

1 

0 

1 

1 

1 

0 

The  problem  is  how  to  set  the  S  values  in  the  face  of  their  apparent  non¬ 
monotonicity. 

Referring  to  the  schematic  diagram  in  Figure  5,  our  solution  is  as  fol¬ 
lows.  We  of  course  include  the  p-channel  precharge  and  n-channel  evaluate 
transistors  (shown  with  input  d>).  But  the  major  design  change  from  ra- 
tioed  nMOS  to  domino  CMOS  is  in  the  values  assigned  to  the  S  wires 
during  setup.  Suppose  once  again  that  during  setup  we  have  the  following 
valid  bit  values  on  the  input  wires: 

Ai,  A%, . . . ,  Ap  =  1 
-dp+ii  Ap+2,  . . . ,  Am  =  0 

B\,  i?2,  •  •  •  1  Bq  —  1 

Bq+ 1  ,  Bq+ 2,  •  •  •  ,  Bm  ~  0  . 

Then  during  setup,  instead  of  setting  only  Sp+i  to  1  (as  in  the  ratioed 
nMOS  design),  we  set 

S\,  52,  •  •  • ,  Sp+i  =  1 

Sp+2,  Sp+3,  ■  ■  ■  1  Sm+ 1  =  0. 

We  still  load  the  registers,  which  we  now  name  . . . ,  Rm+ 1,  only  dur¬ 

ing  setup,  so  that  only  Rp+ 1  is  1,  as  in  the  ratioed  nMOS  version: 

Ri  =  At 

Ri  —  .4,_i  A  .4,  for  1  <  i  <  m 
Rm  +  l  Arn  . 

After  setup,  we  set  the  values  5,  =  Rt  for  1  <  /  <  m  +  1,  so  the  5  wires 
then  take  on  the  values  stored  in  the  registers  during  setup.  Only  5p+i  is 
1,  again  just  as  in  the  ratioed  nMOS  design. 
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Figure  5:  A  domino  CMOS  merge  box  of  size  8.  During  setup,  the  S  wires  have 

the  values  S j,  52 . Sp+ 1  =  1,  but  the  registers  are  set  as  in  the  ratioed  nMOS 

design,  with  only  register  Rp+\  receiving  the  value  1.  After  setup,  the  S  wires  get 
their  values  from  the  registers,  so  only  Sp+\  =  1,  as  in  the  ratioed  nMOS  case. 
The  conducting  paths  to  ground  during  the  evaluate  phase  of  setup  are  circled  for 
the  case  of  p  =  2  and  q  =  3. 


We  shall  now  see  why  this  construction  works.  We  first  look  at  setup. 

In  the  precharge  phase  <p  of  setup,  the  diagonal  wires  C\ ,  C2 . C2m  are 

charged  high,  so  the  output  wires  CUC2, . . .  ,C2m  are  all  low.  During  the 

evaluate  phase  4>  of  setup,  the  wires  Ct.C2 _ _ Cp  are  pulled  down  by 

.Ai,  A2, . . . ,  Ap,  as  in  the  ratioed  nMOS  design.  Unlike  the  ratioed  nMOS 

case,  however,  we  have  Si,S2, - Sp+!  =  1  and  BX,B2 _ _  Bq  =  1.  If  we 

have  q  >  0,  then  the  wires  CUC2,. . .  ,CP+,  (not  just  (7p+1,Cp+2, . . .  ,Cp+9) 
are  pulled  down  by  the  5  and  B  wires  during  the  evaluate  phase  <p  of  setup. 
If  we  have  instead  q  —  0,  then  no  C  wires  are  pulled  down  by  S  and  B 
wires.  The  result  is  that  during  the  evaluate  phase  of  setup,  the  output 
wires  take  on  the  values 

C^<C2 . Cp+7  =  1 

Cp+g  +  l,  ^V+9+2’  •  •  ■  •  C'2m  =  0  , 

as  desired.  During  evaluate  phases  after  setup,  the  .4,  B.  and  S  wires  take 
on  the  same  values  as  they  do  in  the  ratioed  nMOS  circuit,  so  the  circuit 
again  works  correctly. 

With  the  A  and  B  inputs  monotonicallv  increasing  during  the  evaluate 
phase,  and  with  the  R  registers  designed  so  that  their  outputs  undergo 
no  l-to-0  transitions  during  the  evaluate  phase,  each  merge  box  is  a  well- 
behaved  domino  CMOS  circuit.  Since  the  output  of  each  merge  box  is 
monotonically  increasing  during  the  evaluate  phase  and  is  the  input  of  a 
merge  box  in  the  next  stage,  all  we  need  for  the  entire  hyperconcentrator 
switch  to  be  a  well-behaved  domino  CMOS  circuit  is  for  the  .4  and  B  inputs 
to  the  first  stage  to  be  monotonically  increasing  during  the  evaluate  phase. 

6  Applications 

This  section  discusses  applications  of  the  hyperconcontrator  switch.  One 
application,  the  one  for  which  the  switch  was  designed,  allows  us  to  use  the 
available  clock  period  more  efficiently  in  bit-serial  routing  networks,  thus 
improving  their  performance.  Another  application  is  its  use  in  building 
a  superconcentrator  switch.  Finally,  we  show  how  the  hyperconcentrator 
switch  can  be  used  as  a  subcircuit  in  building  larger  switches  that  span 
multiple  chips. 


Figure  6:  A  2-input,  2-output  butterfly  node.  The  selectors  and  2-by-l  concen¬ 
trator  switches  ensure  that  botli  input  messages  reach  output  wires  if  their  address 
bits  specify  that  they  are  going  in  different  directions.  If  the  messages  contend  for 
the  same  output  wire,  the  concentrator  switches  ensure  that  one  makes  it  through. 
With  randomly  chosen  address  bits,  we  expect  3n/4  of  the  n  messages  to  be  suc¬ 
cessfully  routed  through  this  node. 

Improving  Network  Performance 

We  can  replace  small,  simple  switches  in  a  bit-serial  routing  network  by 
concentrator  switches  to  successfully  route  more  messages  in  a  single  clock 
cycle,  thus  using  the  available  clock  period  more  efficiently.  The  routing 
network  switches  we  shall  consider  route  valid  messages  either  left  or  right, 
based  on  an  address  bit  immediately  following  the  valid  bit.  Such  a  routing 
scheme  is  used,  for  example,  in  a  butterfly  network.  An  address  bit  of 
0  indicates  that  the  valid  message  should  be  routed  to  a  left  output  of  a 
switch,  and  an  address  bit  of  1  indicates  that  the  valid  message  should  be 
routed  to  the  right. 

Consider  the  2-input.  2-output  butterfly  node  shown  in  Figure  6.  A 
single  level  of  a  routing  network  such  as  a  butterfly  would  typically  have 
several  such  nodes  side  bv-side.  The  node  contains  two  simple  2-by-l  con¬ 
centrator  switches,  depicted  as  trapezoids,  one  with  outputs  going  left  and 
one  with  outputs  going  right.  Each  simple  concentrator  switch  is  preceded 
by  a  selector  circuit  that,  given  an  input  valid  bit  and  an  address  bit,  pro¬ 
duces  a  new  valid  bit  which  is  ]  if  and  only  if  the  input  valid  bit  is  1  and 
the  address  bit  matches  the  output  direction  of  the  concentrator  switch.  If 


Figure  7:  A  generalized  butterfly  node  with  n  inputs  and  n  outputs,  shown  here 
for  n  =  8.  There  are  two  n-by-w/2  concentrator  switches.  With  randomly  chosen 
address  bits,  we  expect  n  -  0(\/n)  messages  to  be  successfully  routed  through  this 
node. 

two  valid  messages  with  equal  address  bits  enter  a  butterfly  node,  only  one 
is  successfully  routed. 

The  problem  with  this  scheme  is  that  it  does  not  use  the  available  clock 
period  efficiently.  The  simple  node  uses  only  a  few  levels  of  logic,  so  the 
delay  through  it  is  only  a  few  nanoseconds.  But  because  of  the  large  amount 
of  time  required  to  get  signals  on  and  off  chips  in  current  technologies,  we 
might  be  unable  to  distribute  a  clock  with  a  frequency  high  enough  to  match 
the  short  delay  of  this  node.  In  fact,  the  clock  period  we  can  distribute  is 
typically  at  least  an  order  of  magnitude  greater  than  the  delay  through  this 
node.  This  node  therefore  performs  no  useful  work  in  at  least  90  percent 
of  each  clock  cycle. 

Now  consider  the  generalized  //-input,  /(-output  butterfly  node  shown 
for  r?  =  8  in  Figure  7.  Like  four  simple  butterfly  nodes  of  Figure  6  laid 
side-by-side,  it  has  a  total  of  8  input  wires  and  8  output  wires,  with  4 
outputs  going  left  and  4  outputs  going  right.  But  here  we  use  two  n-by- 
n/2  concentrator  switches,  one  with  outputs  only  going  left  and  one  with 
outputs  only  going  right. 

The  advantage  held  by  the  larger  node  is  that  at  the  same  clock  speed 
as  the  simple  nodes,  it  can  successfully  route  more  valid  messages  in  each 
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clock  cycle.  The  clock  speed  remains  the  same  because  the  additional 
delay  introduced  by  the  larger  concentrator  switches  is  just  soaked  up  by 
the  unused  portion  of  the  clock  period.  These  nodes  use  a  larger  portion 
of  the  available  clock  period.  Since  the  simple  nodes  leave  so  much  of 
the  available  clock  period  unused,  we  can  even  scale  these  concentrator 
switches  up  considerably  before  the  delay  introduced  exceeds  the  original 
clock  period. 

We  shall  now  show  that  more  valid  messages  are  routed  in  a  single  clock 
cycle  by  the  larger  nodes.  Wc  assume  that  a  valid  message  arrives  at  each 
input  wire  of  each  node  and  that  the  address  bit  is  0  with  probability  1/2, 
independent  of  the  address  bits  of  other  messages.  We  shall  show  that 
on  average,  the  small  node  successfully  routes  only  a  constant  fraction  of 
the  valid  messages,  but  on  average,  the  larger  nodes,  with  n  input  wires, 
successfully  route  n  —  0(  \/n)  valid  messages.  Intuitively,  the  larger  nodes 
successfully  route  more  valid  messages  because  they  have  more  freedom  in 
mapping  inputs  to  outputs. 

First  consider  the  2-input,  2-output  node  of  Figure  6.  If  the  valid  mes¬ 
sages  have  unequal  address  bits,  which  occurs  with  probability  1/2,  no  valid 
messages  are  lost.  If  the  address  bits  are  equal,  which  also  occurs  half  the 
time,  one  of  the  valid  messages  is  lost.  Since  the  switches  act  independently, 
the  probability  that  a  valid  message  is  lost  is  1/4,  so  we  expect  that  3/4  of 
the  valid  messages  are  successfully  routed. 

Now  consider  a  routing  network  node  like  that  of  Figure  7,  but  with 
n  inputs  and  n  outputs.  Suppose  we  have  k  valid  messages  with  address 
bits  of  0  (O-messages)  and  n  —  k  with  address  bits  of  1  (1-messages).  If 
k  >  n/2,  then  k  —  nj 2  of  the  O-messages  are  lost  and  no  1-messages  are 
lost.  Conversely,  if  k  <  n/2,  then  n/2  —  k  of  the  1-messages  are  lost 
and  no  O-messages  are  lost.  Thus,  when  there  are  k  O-messages,  there  are 
| k  —  n/2|  valid  messages  lost..  The  number  of  O-messages,  k,  is  binomially 
distributed,  with  the  probability  of  a  valid  message  being  a  0-message  equal 
to  1/2  and  the  expected  number  of  O-messages  equal  to  E(k)  =  n/2.  The 
expected  number  of  valid  messages  lost  is  E(\k  —  n/2|).  To  determine  an 
Tipper  bound  on  this  value,  we  note  that 

E(\k  —  n/2|2)  =  E((k  —  n/2)2) 

=  E((k  -E(k))2) 


For  any  random  variable  A,  we  have  the  identity 

var(A')  =  E(X2)  —  (E(X))2  , 

which  combined  with 


gives  us 


or 


var(A')  >  0 
(£(A))2  <  £(A2) 


We  thus  have 


E(X)  <  yjE(X 2). 

E(\k-n/2\)  <  sl~E(\k-nl2\2) 

—  yfn/2  . 

The  expected  number  of  valid  messages  lost  is  therefore  0(\/n)  and  thus 
the  expected  number  of  successfully  routed  valid  messages  is  n  —  0(y/n). 


Superconcentrator  Switches 

Another  application  of  hyperconcentrator  switches  is  in  building  supercon¬ 
centrator  switches.  An  n-by-n  superconcentrator  switch  has  n  input  wires 
A\ ,  X2, . . . ,  ATn  and  n  output  wires  }' j ,Y\ 2, . . . ,  V’n-  For  any  1  <  k  <  n,  dis¬ 
joint  electrical  paths  may  be  established  from  any  set  of  k  input  wires  to 
any  arbitrarily  chosen  set  of  k  output  wires.  Superconcentrator  switches 
are  useful  in  fault-tolerant  systems.  If  some  of  the  output  wires  of  a  con¬ 
centrator  switch  may  be  faulty,  we  can  use  a  superconcentrator  switch  that 
routes  signals  to  only  the  good  output  wires. 

We  can  build  a  superconcentrator  switch  out  of  two  full-duplex  hyper¬ 
concentrator  switches  Hr  and  Hr,  as  shown  in  Figure  8. 4  (After  setup  in 

4Thi8  construction  is  shown  in  [15], 


I 


Figure  8:  A  superconcentrator  switch  built  out  of  two  hyperconcentrator  switches. 
The  hyperconcentrator  switch  Hr  is  set  up  the  connect  the  first  l  reverse  input 
wires  Zi,Z2,...,Z/  to  the  /  good  reverse  output  wires,  which  also  serve  as  the 
output  wires  of  the  superconcentrator  switch.  The  k  valid  messages  are  then  routed 
by  the  hyperconcentrator  switch  Hr  to  the  wires  Z\,  Z2, . . . ,  Zk  and  through  the 
switch  Hr  to  the  first  k  good  output  wires. 


a  full-duplex  hyperconcentrator  switch,  signals  can  travel  along  the  estab¬ 
lished  paths  simultaneously  in  both  forward  and  reverse  directions.  Ex¬ 
tending  the  design  of  the  hyperconcentrator  switch  to  make  it  full-duplex 
is  straightforward.)  The  output  wires  of  the  switch  Hr  (a  “forward”  hy¬ 
perconcentrator  switch)  feed  directly  into  the  reverse  input  wires  of  the 
full-duplex  hyperconcentrator  switch  Hr  (a  “reverse”  hyperconcentrator 
switch).  Suppose  there  are  I  good  output  wires  of  the  superconcentrator 
switch.  Before  setup  of  the  superconcentrator  switch,  the  switch  Hr  sets 
up  electrical  paths  from  its  first  I  reverse  input  wires  Zj,  Z2, . . . ,  Zi  to  the  l 
good  reverse  output  wires.  These  paths  are  established  by  assigning  a  1  to 
each  forward  input  wire  of  the  switch  Hr  that  corresponds  to  a  good  out¬ 
put  wire,  assigning  a  0  to  the  forward  input  wires  corresponding  to  faulty 
output  wires,  and  running  a  setup  cycle  of  the  switch  Hr. 

Setup  of  the  superconcentrator  switch  is  then  just  setup  of  the  hypercon- 
centrator  switch  Hy.  The  k  valid  messages  are  routed  through  the  switch 

Hy  to  the  wires  Z\ .  Z2 . Zk  and  then  along  the  reverse  paths  through 

the  switch  Hr  to  the  first  k  good  output  wires. 
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Building  Large  Switches 

The  hyperconcentrator  switch  can  also  he  used  as  a  building  block  in  large 
concentrators.  For  example,  replacing  the  comparators  in  an  arbitrary  sort¬ 
ing  network  by  n-by-n  hyperconcentrator  switches  yields  a  large  hypercon¬ 
centrator.  (Actually,  only  the  first  level  of  comparators  must  be  replaced  by 
hyperconcentrator  switches;  merge  boxes  suffice  at  all  subsequent  levels.) 

We  have  also  found  that  efficient  multichip  partial  concentrator  switches 
can  be  built  from  hyperconcentrator  chips.  An  (n,m,a)  partial  concentra¬ 
tor  switch  has  n  inputs,  m  outputs,  and  a  fraction  a  such  that  if  there  are 
k  valid  messages  entering  the  switch,  then 

•  If  k  <  am ,  each  valid  message  is  routed  to  an  output. 

•  If  k  >  am,  at  least  am  valid  messages  are  routed  to  outputs. 

A  lightly  loaded  ( n,m,a )  partial  concentrator  switch  is  similar  to  an  n-by- 
m  concentrator  switch,  in  that  if  the  number  of  valid  messages  to  route  is 
at  most  am,  then  all  the  valid  messages  are  successfully  routed  to  outputs. 

One  multichip  partial  concentrator  switch  construction  [2,3]  is  based  on 
the  Revsort  two-dimensional  mesh  sorting  algorithm  of  Schnorr  and  Shamir 
[14]  and  uses  3 y/n  hyperconcentrator  chips  with  with  y/n  inputs  each.  This 
construction  yields  an  (n,  m,  1  —  0(n3'4/m))  partial  concentrator  switch  in 
three-dimensional  volume  0(n3/2).  A  signal  incurs  3  lg  n  +  0(1)  gate  delays 
in  passing  through  this  switch. 

Another  such  construction  [3],  based  on  Leighton’s  Columnsort  algo¬ 
rithm  [9],  uses  0(n1-^)  hyperconcentrator  chips  with  @(n 0)  inputs  each,  for 
any  1/2  <  0  <  1.  This  construction  produces  an  (n,m,  1  —  O(n2~20))  par¬ 
tial  concentrator  switch  in  volume  0(n1+/3).  A  signal  incurs  4l3\gn  -I-  0(1) 
gate  delays  in  passing  through  this  switch. 

One  might  wonder  how  to  build  multichip  hyperconcentrator  switches. 
Partitioning  the  n-by-77  hyperconcentrator  switch  presented  in  this  paper 
among  multiple  chips  with  p  pins  each  requires  Q((n/p)2)  chips,  since  each 
p-pin  chip  has  area  0(p 2)  and  there  are  0(n2)  components  to  partition.  We 
may  want  to  partition  a  hvperconcentrator  switch  for  two  reasons: 

1.  The  0(n2)  area  may  exceed  the  available  chip  area. 


2.  If  the  switch  is  to  he  packaged  by  itself  on  a  chip,  it  may  require  more 
input  and  output  pins  than  are  provided  by  the  packaging  technology. 

A  different  n-by-n  hyperconcentrator  switch  design,  consisting  of  a  par¬ 
allel  prefix  circuit  and  a  butterfly  network  [2],  can  be  built  in  volume  ©(n3/2) 
with  O(nlgn)  chips  and  as  few  as  four  data  pins  per  chip,  but  this  switch 
is  not  combinational.  Although  its  sequential  control  is  not  very  complex, 
it  is  not  as  simple  as  that  of  a  combinational  circuit. 

We  can  build  multichip  hyperconcentrator  switches  by  extending  either 
of  the  above  multichip  partial  concentrator  switch  designs  [3].  By  extending 
the  Revsort-based  design,  we  can  build  a  multichip  n-by-n  hyperconcentra¬ 
tor  switch  that  uses  ©(^/nlglgn)  chips  with  &{\/n)  pins  each  in  volume 
0(n3/2lglgn),  inducing  41gnlglgn  -I-  81gn  +  O(lglgn)  gate  delays.  An 
extension  of  the  Columnsort- based  design  yields  a  multichip  n-by-n  hyper¬ 
concentrator  switch  that  uses  0(  n1  ~a )  chips  with  0(  n 3 )  pins  each  in  volume 
0(n1+‘3)  for  any  1/2  <  3  <  1.  A  signal  incurs  8,3  lg  n  +  0(1)  gate  delays  in 
passing  through  this  switch. 

7  Conclusion 

In  this  section,  we  describe  a  circuit  containing  the  hyperconcentrator 
switch  which  has  been  fabricated.  We  also  briefly  discuss  other  applica¬ 
tions  of  the  switch.  Finally,  we  pose  some  open  questions. 

We  have  implemented  a  4/un  nMOS  10-by-lG  hyperconcentrator  switch, 
which  was  fabricated  by  MOSIS.  The  chip  contains  programmable  selector 
circuitry  preceding  the  hyperconcentrator  switch  so  that  an  independent 
routing  decision  can  be  made  for  each  input,  as  in  Figures  6  and  7.  Each 
of  the  16  selectors  includes  a  UV  write-enabled  PROM  cell,  described  in 
[4].  The  bit  value  stored  in  each  PROM  cell  is  compared  with  an  address 
bit  in  the  input  message  to  determine  whether  the  message  is  going  in  the 
correct  direction.  The  device  is  awaiting  test. 

The  approach  of  replacing  many  small  routing  nodes  by  fewer  nodes 
with  larger  concentrator  switches  is  used  by  the  cross-omega  network  [7]. 
Part  of  the  cross-omega  network  is  based  on  a  truncated  butterfly  network. 
Single  wires  of  the  butterfly  network  are  replaced  by  bundles  of  32  wires, 
and  the  simple  butterfly  network  nodes  are  replaced  by  nodes  like  that  of 


Figure  7,  but  with  32  inputs,  32  outputs,  and  two  32-by-lG  concentrator 
switches.  Fat-trees  serve  as  another  example  of  a  class  of  routing  networks 
that  makes  use  of  concentrator  switches.  The  interested  reader  is  referred 
to  [6]  and  [10]  for  details. 

An  open  question  is  for  what  functions  f(p)  can  we  build  an 
(fi(/(p)),  m,  1  —  o(p/m))  partial  concentrator  switch,  given  chips  with  p 
pins  and  using  only  two  stages  of  chips.  The  Columnsort-based  partial  con¬ 
centrator  construction,  for  example,  gives  us  f(p)  —  p2~‘  for  any  0  <  e  <  1. 
Can  we  achieve  f(p)  =  Q(p2)?  In  general,  how  large  a  function  f(p)  can 
we  achieve  with  k  stages? 

It  is  natural  to  ask  whether  a  simple  design  for  a  concentrator  switch 
exists  when  we  relax  the  constraint  that  all  the  valid  messages  arrive  at  the 
same  time.  A  crossbar  switch  has  the  capability  of  allowing  valid  messages 
to  come  and  go  at  any  time,  but  switch  setup  can  be  expensive.  It  may  be 
that  a  concentrator  switch  can  be  designed  that  allows  new  messages  to  be 
routed  in  batches  while  preserving  old  connections. 
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